ggplot2In this lecture, we will take a look at how to visualize data using the powerful ggplot2 package. We will use ggplot2 a lot throughout the rest of the course!
ggplot2 – a grammar of graphics.ggplot2.Let’s first load the packages that we need for this chapter.
library("knitr") # for rendering the RMarkdown file
library("tidyverse") # for plotting (and many more cool things we'll discover later)
The tidyverse is a collection of packages that includes ggplot2.
The greatest value of a picture is when it forces us to notice what we never expected to see. — John Tukey
There is no single statistical tool that is as powerful as a well‐chosen graph. (Chambers et al. 1983)
…make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding. (Anscombe 1973)
Figure 1.1: Anscombe’s quartet.
Anscombe’s quartet in Figure 1.1 (left side) illustrates the importance of visualizing data. Even though the datasets I-IV have the same summary statistics (mean, standard deviation, correlation), they are importantly different from each other. On the right side, we have four data sets with the same summary statistics that are very similar to each other.
Figure 1.2: The Pearson’s \(r\) correlation coefficient is the same for all of these datasets. Source: Data Visualization – A practical introduction by Kieran Healy
All the datasets in Figure 1.2 share the same correlation coefficient. However, again, they are very different from each other.
Figure 1.3: The Datasaurus Dozen. While different in appearance, each dataset has the same summary statistics to two decimal places (mean, standard deviation, and Pearson’s correlation).
The data sets in Figure 1.3 all share the same summary statistics. Clearly, the data sets are not the same though.
Tip: Always plot the data first!
Here is the paper from which I took Figure 1.1 and 1.3. It explains how the figures were generated and shows more examples for how summary statistics and some kinds of plots are insufficient to get a good sense for what’s going on in the data.
Figure 1.4: Animation showing different data sets that all share the same summary statistics.
Below are some examples of visualizations that could be improved. How would you make them better?